Introduction to Open Data Science - Course Project

This course has three big themes

  • open data
  • open science
  • data science

After completing the course students

  • understand possibilities of open science, open data, and open research tools
  • master state-of-the-art, open and free software tools: RStudio and GitHub
  • write dynamic reports (text, code, graphs, tables, etc.) in R Markdown
  • apply the principle of reproducible research and see its practical advantages
  • visualize and analyze data sets with some fairly advanced statistical methods
  • learn much more of all this in the future (continuous learning)

My introduction

I’m a PhD Candidate at University of Eastern Finland (Doctoral Programme of Clinical Research) doing pharmacogenomic research. We use population level health registry and genomic data in our research and I’m expecting to learn new tools and tips for coding and statistical analysis of big data during this course. Couple of other PhD students recommended this course as useful.

My GitHub repository can be found here: https://github.com/katjahakki/IODS-project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Sun Dec  5 19:15:17 2021"

Regression and model validation

1. Structure of the data

The original data file contains JYTOPKYS2 survey data concerning learning and teaching in a statistical course and reference to the data source can be found here https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS2-meta.txt. A new analysis dataset learning2014 was created with variables gender, age, attitude, deep, stra, surf and points by combining questions from the original data file. All combination variables were scaled to the original scales by taking the mean and observations with exam points being zero were excluded.

The learning2014 dataset is a dataframe with 166 observations and 7 variables. The observations represent students in the statistical course (rows) and each column contains values of one variable.

Meaning of the variable abbreviations:
attitude = attitude towards statistics
deep = deep learning
stra = strategic learning
surf = surface learning
points = total points of the course exam

### RStudio exercise 2 - The Analysis exercise
date()
## [1] "Sun Dec  5 19:15:17 2021"
# read in data,  learning2014.csv file, using read.csv()
learning2014 <- read.csv('~/IODS-project/data/learning2014.csv')

# show first 6 observations of the data
head(learning2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
# data structure
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
# data dimensions, rows and columns
dim(learning2014)
## [1] 166   7
# number of rows in the data
nrow(learning2014)
## [1] 166
# number of columns in the data
ncol(learning2014)
## [1] 7

Summary() function is a generic R function which produces basic statistics of the data. The function is also used to produce result summaries of the results of various model fitting functions.

# summaries, basic statistics of the variables
summary(learning2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
# distribution of female and male students
summary(as.factor(learning2014$gender))
##   F   M 
## 110  56

Majority of the students are female students. We can see the minimum, first quartile, median, third quartile and maximum values of the numeric variables with the summary function.

2. Graphical overview of the data

Let’s draw a scatter plot matrix of the data.

# change variable gender from character to factor in order to draw a scatter plot
learning2014$gender = as.factor(learning2014$gender)

# draw a scatter plot matrix of the variables in learning2014
# [-1] excludes the first column (gender)
pairs(learning2014[-1], col=learning2014$gender)

Scatter plot matrices are a way to roughly determine if you have a linear correlation between multiple variables. The variables are in a diagonal line from top left to bottom right and each variable are plotted against each other. The boxes on the upper right of the scatter plot are mirror images of the plots on the lower left.

Let’s create a more advanced plot matrix of the data.

# access the GGally and ggplot2 libraries
#install.packages("GGally")
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create plot matrix with ggpairs()
# Adjust the argument mapping of ggpairs() by defining col = gender inside aes().
# Adjust the code a little more: add another aesthetic element alpha = 0.3 inside aes().
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

The plot matrix can be used to visualize variable relationships for several variables in a single plot. Scatter plots of each pair of numeric variable are drawn on the left part of the plot, pearson correlation is shown on the right and variable distribution is displayed on the diagonal. This plot shows all data separately for female (red color) and male (green color) students. Up row shows boxplots and column on the left shows histograms of the variables. The boxplots display the distribution of data based on a five number summary: minimun, first quartile, median, third quartile and maximum. Outliers are shown as points outside of the boxes.

The strongest correlation is between points and attitude variables. When looking at the diagonal distributions, the age variable has a right-skewed distribution, the students are mainly of young age. The barplot in the up left corner shows the gender distribution, female students are the majority group.

Let’s draw a scatter plot of student’s attitude versus exam points with a regression line.

# Access the ggplot2 library
library(ggplot2)

# initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points, col = gender))

# define the visualization type (points)
p2 <- p1 + geom_point()

# add a regression line
p3 <- p2 + geom_smooth(method = "lm")

# add a main title and draw the plot
p4 <- p3 + ggtitle("Student's attitude versus exam points")

# draw the plot
p4
## `geom_smooth()` using formula 'y ~ x'

The plot and the regression line indicate a linear relationship between student’s attitude and exam points for both female and male students.

3. Fitting a regression model

Simple linear regression models the relationship between the magnitude of one variable and that of a second - for example, as X increases, Y also increases. Or as X increases, Y decreases. Regression quantifies the nature of the relationship between the variables. The simple linear regression estimates how much Y will change when X changes by a certain amount. With regression, we predict the Y variable from X using a linear relationship.

When there are more than one explanatory variables in the linear model, it is called multiple regression. The response variable is assumed to be normally distributed with a mean that is linear function of the explanatory variables, and a variance that is independent of them.

Let’s fit a multiple linear regression model where exam points is the target variable (Y) and age, attitude and stra are the explanatory variables (X variables) and then print out the summary of the fitted model.

# fit a regression model, exam points is the target (dependent) variable
my_model <- lm(points ~ age + attitude + stra,  data = learning2014)

# summary of the fitted model
summary(my_model)
## 
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## age         -0.08822    0.05302  -1.664   0.0981 .  
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

The question is does age, attitude or stra variables explain the amount of exam points?

We can see from the summary that the attitude variable has a statistically significant p-value. That means the variable attitude has a statistically significant relationship with the target variable points. The variables age and stra are not statistically significant, the p-values are > 0.1, meaning that these variables do not have a statistically significant relationship with the target variable points.

Let’s remove age and stra variables from the model and fit a model again without them. The points is the target variable and the attitude is the explanatory variable.

# fit a regression model with statistically significant explanatory variable
my_model2 <- lm(points ~ attitude,  data = learning2014)

4. Interpretting the model parameters

Let’s print the summary of the simple regression model.

# summary of the fitted my_model2
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

In the summary statistics of the regression model, call shows what function and parameters were used to create the model. Residuals show the difference between what the model predicted and the actual value of y. The coefficient estimate contains two rows, the intercept and the slope.The intercept is the expected mean value of Y when X=0. The coefficient standard error measures the average amount that the coefficient estimates vary from the actual average value of the target variable. The coefficient t-value is a measure of how many standard deviations the coefficient estimate is far from 0. The Pr(>t) in the model output relates to the probability of observing any value equal or larger than t.

A small p-value indicates that is unlikely to observe a relationship between the explanatory and target variables due to chance. Three stars in ‘signif.Codes’ represent a highly significant p-value (<0.05).

We can see an estimate 3.5 for the attitude variable which means that if attitude towards statistics score increases by 1 then student’s exam points increase by 3.5 points.

Residual standard error is measure of a linear regression fit and F-statistic is used to indicate whether there is a relationship between the predictor and the response variables. Multiple R-squared ranges from 0 to 1 and measures the proportion of variation in the data that is accounted for in the model, it’s used to assess how well the model fits the data. In this case, 19.06% of the variation in exam points is explained by the variation in attitude towards statistics.

5. Diagnostic plots of the regression model

Let’s draw diagnostic plots for the regression model.

# Produce the following diagnostic plots: Residuals vs Fitted values,
# Normal QQ-plot and Residuals vs Leverage

par(mfrow = c(2,2))

plot(my_model2, which = c(1, 2, 5))

In a linear regression model an assumption is linearity: the target variable is modeled as a linear combination of the mode parameters. Usually it is assumed that the errors are normally distributed. By analyzing the residuals of the model we can explore the validity of the model assumptions. It is also assumed that the errors are not correlated, they have constant variance and the size of a given error does not depend on the explanatory variables.

Residuals vs Fitted values
The constant variance assumption implies that the size of the errors should not depend on the explanatory variables. This can be checked with a scatter plot of residuals versus model predictions. A residual is a measure of how well line fits an individual data point. The closer a data point’s residual is to 0, the better fit. The linearity seems to mainly hold as the red line is close to the dashed line in the plot. Potential problematic cases are with the row numbers 35, 56 and 145.

Normal QQ-plot, normality of the errors
QQ-plot of the residuals is a method to explore the assumption that the errors of the model are normally distributed. In the plot the data points seem to be approximately forming a line according to the dash line verifying the assumption of the residual being normally distributed. Potential problematic cases are with the row numbers 35, 56 and 145.

Residuals vs Leverage, leverage of observations
Leverage measures how much impact a single observation has on the model and the residuals vs leverage plot can be used to identify which observations have an unusually high impact. In the plot most of the cases are well inside of the Cook’s distance lines and the red line follows the dashed line quite well. It seems there’s no major impact of single observations on the model. When data points are outside of a dashed line, Cook’s distance, the cases may be influential to the regression results. Potential problematic cases are with the row numbers 35, 56 and 71.


Logistic regression

1. Structure of the data

The data are from two identical questionnaires related to secondary school student alcohol comsumption in Portugal. Data source can be found here: UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/dataset). Metadata is available at: https://archive.ics.uci.edu/ml/datasets/Student+Performance.

A new analysis dataset pormath was created by joining math course and Portuguese language course datasets. The two data sets were joined using all other variables than “failures”, “paid”, “absences”, “G1”, “G2”, “G3” as (student) identifiers.

## RStudio Exercise 3: Analysis
date()
## [1] "Sun Dec  5 19:15:26 2021"
# read in the joined student alcohol consumption data pormath, using read.csv()
pormath <- read.csv('~/IODS-project/data/pormath.csv')

# introduction of the dataset pormath, structure of the data
str(pormath)
## 'data.frame':    370 obs. of  51 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : chr  "R" "R" "R" "R" ...
##  $ famsize   : chr  "GT3" "GT3" "GT3" "GT3" ...
##  $ Pstatus   : chr  "T" "T" "T" "T" ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : chr  "at_home" "other" "at_home" "services" ...
##  $ Fjob      : chr  "other" "other" "other" "health" ...
##  $ reason    : chr  "home" "reputation" "reputation" "course" ...
##  $ guardian  : chr  "mother" "mother" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : chr  "yes" "yes" "yes" "yes" ...
##  $ famsup    : chr  "yes" "yes" "yes" "yes" ...
##  $ activities: chr  "yes" "no" "yes" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "yes" "yes" "no" "yes" ...
##  $ romantic  : chr  "no" "yes" "no" "no" ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ n         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ id.p      : int  1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
##  $ id.m      : int  2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : chr  "yes" "no" "no" "no" ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ failures.p: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ paid.p    : chr  "yes" "no" "no" "no" ...
##  $ absences.p: int  4 2 8 2 2 2 0 0 6 10 ...
##  $ G1.p      : int  13 13 14 10 13 11 10 11 15 10 ...
##  $ G2.p      : int  13 11 13 11 13 12 11 10 15 10 ...
##  $ G3.p      : int  13 11 12 10 13 12 12 11 15 10 ...
##  $ failures.m: int  1 2 0 0 2 0 2 0 0 0 ...
##  $ paid.m    : chr  "yes" "no" "yes" "yes" ...
##  $ absences.m: int  2 2 8 2 8 2 0 2 12 10 ...
##  $ G1.m      : int  7 8 14 10 10 12 12 8 16 10 ...
##  $ G2.m      : int  10 6 13 9 10 12 0 9 16 11 ...
##  $ G3.m      : int  10 5 13 8 10 11 0 8 16 11 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ cid       : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
# print out names of the variables
colnames(pormath)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"
# rows and columns in the dataset
dim(pormath)
## [1] 370  51

The pormath dataset is a dataframe with 370 observations (rows) and 51 variables (columns). The observations represent students and each column contains values of one variable.

2. Variables for logistic regression analysis

Let’s choose 4 variables and create hypotheses for logistic regression.

The variables:
freetime: free time after school (numeric: from 1 - very low to 5 - very high)
absences: number of school absences (numeric: from 0 to 93)
age: student’s age (numeric: from 15 to 22)
goout: going out with friends (numeric: from 1 - very low to 5 - very high)

The hypotheses:
1: students with free time have a higher probability to high alcohol consumption than students with less free time
2: students with absences have a higher probability to consume more alcohol than students with less absences
3: older students have a higher probability to high alcohol consumption than younger students
4: students who go out with friends have a higher probability to high alcohol consumption than students who go out less with their friends

3. Numerical and graphical overview of the variables

Numerical overview of the data can be done for example by exploring the frequencies of the variables in cross tables and basic statistic summaries of the variables using table() and summary() R functions.

## frequencies of the variables, cross tables

# low and high alcohol consumption
summary(as.factor(pormath$high_use))
## FALSE  TRUE 
##   259   111
# alcohol consumption and freetime 
table(high_use = pormath$high_use, freetime = pormath$freetime)
##         freetime
## high_use   1   2   3   4   5
##    FALSE  15  46 112  65  21
##    TRUE    2  14  40  40  15
# alcohol consumption and absences
table(high_use = pormath$high_use, absences = pormath$absences)
##         absences
## high_use  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 16 17 18 19 20 21 26 27
##    FALSE 50 37 40 31 24 16 16  9 14  5  5  2  3  1  1  0  0  1  0  2  1  0  0
##    TRUE  13 13 16  7 11  6  5  3  6  6  2  3  4  1  6  1  1  1  1  0  1  1  1
##         absences
## high_use 29 44 45
##    FALSE  0  0  1
##    TRUE   1  1  0
# alcohol consumption and age
table(high_use = pormath$high_use, age = pormath$age)
##         age
## high_use 15 16 17 18 19 20 22
##    FALSE 63 74 62 52  7  1  0
##    TRUE  18 28 35 25  4  0  1
# alcohol consumption and going out with friends
table(high_use = pormath$high_use, goout = pormath$goout)
##         goout
## high_use  1  2  3  4  5
##    FALSE 19 82 97 40 21
##    TRUE   3 15 23 38 32

In these two-dimensional frequency cross tables we can see the questionnaire results as frequencies of the entire group of students concerning the two chosen variables in each table. We can see that 111 out of 370 (34,7%) students report high use of alcohol. Students’ alcohol ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.

# select variables freetime, absences, age, goout for joined summary statistics
myvars <- c("freetime", "absences", "age", "goout")
new_pormath <- pormath[myvars]
# print basic statistic summary
summary(new_pormath)
##     freetime        absences           age            goout      
##  Min.   :1.000   Min.   : 0.000   Min.   :15.00   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.: 1.000   1st Qu.:16.00   1st Qu.:2.000  
##  Median :3.000   Median : 3.000   Median :17.00   Median :3.000  
##  Mean   :3.224   Mean   : 4.511   Mean   :16.58   Mean   :3.116  
##  3rd Qu.:4.000   3rd Qu.: 6.000   3rd Qu.:17.00   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :45.000   Max.   :22.00   Max.   :5.000
# access library psych
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
# create summary table
describe(pormath[ , c('freetime', 'absences', 'age', 'goout')], fast = TRUE)
##          vars   n  mean   sd min max range   se
## freetime    1 370  3.22 0.99   1   5     4 0.05
## absences    2 370  4.51 5.50   0  45    45 0.29
## age         3 370 16.58 1.18  15  22     7 0.06
## goout       4 370  3.12 1.13   1   5     4 0.06

We can see the minimum, first quartile, median, third quartile and maximum values of the freetime, absences, age and goout variables with the summary function. Summary table created with describe() function outputs also column number (vars), number of valid cases (n), standard deviation (sd) and standard error (se) for the variables.

Let’s draw bar plots for each variable.

## bar plots to describe distributions of the variables

# access the libraries tidyr, dplyr, ggplot2 and cowplot
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(cowplot)

# bar plots for each variable
g1 <- ggplot(pormath, aes(x = high_use, fill = sex)) + geom_bar()
g2 <- ggplot(pormath, aes(x = freetime, fill = sex)) + geom_bar()
g3 <- ggplot(pormath, aes(x = absences, fill = sex)) + geom_bar()
g4 <- ggplot(pormath, aes(x = age, fill = sex)) + geom_bar()
g5 <- ggplot(pormath, aes(x = goout, fill = sex)) + geom_bar()

# draw all bar plots
plot_grid(g1, g2, g3, g4, g5, labels="AUTO")

The bar plots show counts for each variable by gender. Approximately third of the students report high alcohol consumption and of those students the majority are the male students. Most of the students have average (3) or high amount (4) of freetime after school and have approximately 0 to 14 of absences. Majority of the students are between 15 to 18 years of age and the gender distribution is roughly even for male and female students. Going out with friends vary mainly from low (2) to very high (5)

Let’s draw box plots to see the relationships of the variables with the alcohol consumption

## box plots to describe distributions of the variables
# box plots of the variables

g6 <- ggplot(pormath, aes(x = high_use, y = freetime, col = sex)) + geom_boxplot() + ylab("freetime")
g7 <- ggplot(pormath, aes(x = high_use, y = absences, col = sex)) + geom_boxplot() + ylab("absences")
g8 <- ggplot(pormath, aes(x = high_use, y = age, col = sex)) + geom_boxplot() + ylab("age")
g9 <- ggplot(pormath, aes(x = high_use, y = goout, col = sex)) + geom_boxplot() + ylab("goout")

# draw boxplots
plot_grid(g6, g7, g8, g9, labels = "AUTO")

In the box plots the median marks the mid-point of the data and is shown by the line that divides the box into two parts. Half the data points are greater than or equal to this value and half are less. The middle “box” represents the middle 50% of data points for the analysis group. Outliers are shown as dots outside the boxes. The box plots display the distribution of the data, if the data is symmetrical, how tightly the data is grouped and if and how the data is skewed. The whiskers are the two lines outside the box, that go from the minimum to the lower quartile (the start of the box) and then from the upper quartile (the end of the box) to the maximum.

We can see from the box plots that:
High alcohol consumption and freetime
- females with low alcohol consumption have greater variability in freetime and less freetime than males with low or high alcohol consumption or females with high alcohol consumption. Roughly it seems that amount of freetime after school doesn’t influence much on alcohol consumption.
High alcohol consumption and absences
- the number of absences is higher for both females and males who report high alcohol consumption.
High alcohol consumption and age
- males with high alcohol use are a bit older than males with low alcohol consumption.
High alcohol consumption and going out with friends
- both female and male students who go out more with friends have higher alcohol consumption.

Based on these box plots hypothesis number 2 (students with more absences consume more alcohol than students with less absences) and number 4 (students who go out with friends consume more alcohol than students who go out less with their friends) could hold. The hypothesis 3 (older students consume more alcohol than younger students) might not hold for the whole group of students.

4. Fitting the logistic regression model

The logistic regression model is a genearalized linear model (GLM). It can be used to assess the effects of a set of explanatory variables on a binary response variable. The response in the logistic regression formula is the log odds of a binary outcome of 1.

Let’s fit the logistic regression model. The binary high/low alcohol consumption variable is the target variable. In R, to fit a logistic regression, the glm function is used with the family parameter set to binomial.

# create logistic regression model
model <- glm(high_use ~ freetime + absences + age + goout, data = pormath, family = "binomial")

# print summary of the model
summary(model)
## 
## Call:
## glm(formula = high_use ~ freetime + absences + age + goout, family = "binomial", 
##     data = pormath)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7891  -0.7617  -0.5500   0.9668   2.3609  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.18857    1.85646  -2.795 0.005192 ** 
## freetime     0.18307    0.13719   1.334 0.182071    
## absences     0.07660    0.02239   3.421 0.000625 ***
## age          0.06957    0.10882   0.639 0.522605    
## goout        0.67244    0.12381   5.431 5.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 388.17  on 365  degrees of freedom
## AIC: 398.17
## 
## Number of Fisher Scoring iterations: 4

In the summary of the logistic regression model, call shows the function and parameters used to create the model. Deviance is a measure of goodness of fit of a genearalized linear model. The null deviance shows how well the target variable is predicted by the model that includes only the intercept. The residual deviance shows how well the response variable can be predicted by a model with predictor variables.

High alcohol consumption and absences
Each unit increase in absences increases the log odds of high alcohol consumption by 0.0766 and p-value indicates that it is statistically significant in determining the high alcohol consumption

High alcohol consumption and going out with friends
Each unit increase in going out with friends increases the log odds of high alcohol consumption by 0.672 and p-value indicates that it is statistically significant in determining the high alcohol consumption

The variables freetime and age are not statistically significant, meaning that these variables do not have a statistically significant relationship with the binary target variable high/low alcohol consumption.

The estimated parameters in the logistic regression model can be interpreted in terms of odds and odds ratios. The exponents of the coefficients of a logistic regression model can be interpret as odds ratios between a unit change (vs no change) in the corresponding explanatory variable. Let’s print the coefficients of the model as odds ratios and confidence intervals for them.

# model with statistically significant variables
model2 <- glm(high_use ~ absences + goout, data = pormath, family = "binomial")

# print the coefficients of the model
coef(model2)
## (Intercept)    absences       goout 
## -3.63280208  0.07631808  0.73380079
# compute odds ratios (OR)
OR <- coef(model2) %>% exp

# compute confidence intervals (CI)
CI <- confint(model2) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their 95% confidence intervals
cbind(OR, CI)
##                     OR      2.5 %     97.5 %
## (Intercept) 0.02644199 0.01082308 0.06045431
## absences    1.07930582 1.03477610 1.13034717
## goout       2.08298255 1.66258410 2.64170298

The odds ratio (OR) 1.08 with confidence intervals 1.035 to 1.13 for absences means that there is a 3.5% to 13% increase in the odds that students with absences have a higher alcohol consumption compared to students with less absences.

The odds ratio (OR) 2.08 with confidence intervals 1.66 to 2.64 for goout variable means that there is 1.7 to 2.6 times the odds of high alcohol consumption for students who go out with friends compared to students who go out less with their friends.

5. Interpreting the predictive power of the logistic regression model

Let’s explore the predictive power of the logistic regression model to know how well the model actually predicts the target value.

# fit the model
model2 <- glm(high_use ~ absences, goout, data = pormath, family = "binomial")

# predict the probability of high_use
probabilities <- predict(model2, type = "response")

# add the predicted probabilities to pormath
pormath <- mutate(pormath, probability = probabilities)

# use the probabilities to make a prediction of high_use
pormath <- mutate(pormath, prediction = probability > 0.5)

# tabulate the target variable versus the predictions (2x2 cross table)
table(high_use = pormath$high_use, prediction = pormath$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   247   12
##    TRUE     88   23
# initialize a plot of 'high_use' versus 'probability' in 'pormath'
g <- ggplot(pormath, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = pormath$high_use, prediction = pormath$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66756757 0.03243243 0.70000000
##    TRUE  0.23783784 0.06216216 0.30000000
##    Sum   0.90540541 0.09459459 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = pormath$high_use, prob = 0)
## [1] 0.3

The first 2x2 cross table is a confusion matrix of predictions versus the actual values of high use of alcohol. The plot displays a graphic visualizing of the actual values and the predictions. The second cross table presents the target variable versus the predicted probabilities. In the tables the predicted outcomes are columns and the true outcomes are the rows. The diagonal elements of the tables show the correct predictions and the off-diagonal elements show the incorrect predictions.

The total proportion of inaccurately classified individuals (the training error) is 0.3 meaning that the model classifies incorrectly roughly 30% of the observations.


Clustering and classification

1. Structure of the data

We’re using the Boston dataset from the MASS package which is already loaded in R and the details of the data source can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

The Boston dataset is a dataframe with 506 observations (rows) and 14 variables (columns). All the variables have numerical values.

Meaning of the variable abbreviations:
crim = per capita crime rate by town
zn = proportion of residential land zoned for lots over 25,000 sq.ft
indus = proportion of non-retail business acres per town
chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox = nitrogen oxides concentration (parts per 10 million)
rm = average number of rooms per dwelling
age = proportion of owner-occupied units built prior to 1940
dis = weighted mean of distances to five Boston employment centres
rad = index of accessibility to radial highways
tax = full-value property-tax rate per $10,000
ptratio = pupil-teacher ratio by town
black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat = lower status of the population (percent)
medv = median value of owner-occupied homes in $1000s

## RStudio Exercise 4: Analysis
date()
## [1] "Sun Dec  5 19:15:29 2021"
# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# read in data
data("Boston")

## explore the dataset
# structure
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# dimensions (rows and columns of the data)
dim(Boston)
## [1] 506  14

2. Graphical overview of the data

Let’s print the summary statistics and a plot matrix of the Boston data.

# summary statistics
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# plot matrix of the variables
pairs(Boston)

Let’s draw a correlation matrix and a correlation plot.

# access libraries
library(MASS)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix <-  cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

Correlations between the variables can be calculated with function cor() to create the correlation matrix. The correlation matrix shows correlation coefficients between variables. Correlations plot with corrplot() function visualizes the correlations between variables in the data. The stronger the red color is, the stronger the negative correlation is between the variables in the plot. The strong blue color indicates strong positive correlation between the variables. We can see from the plot that the strongest negative correlation is between indus-dis, nox-dis, age-dis and lstat-medv variable pairs. The strongest positive correlation is between indus-nox, indus-tax, nox-age, rm-medv and rad-tax variable pairs.

3. Editing the data for linear discriminant analysis

Here we’re scaling the data by substracting the column means from the corresponding columns and dividing the difference with standard deviation with scale() function. The scaling makes sense since we have multiple variables across different scales in the original Boston data. By scaling we standardize the data, the scale() function centers and scales the columns of a numeric matrix.

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

Above a categorical variable of the crime rate (scaled) was created using the quantiles as the break points and the old crime rate variable was dropped from the dataset. The dataset was divided to train and test sets in a way that 80% of the data belongs to the train set. We are using the splitted original and train sets to check how well our model works. The training of the model is done with the train set and prediction on new data is done with the test set.

4. Fitting the linear discriminant analysis on the train set

The MASS package provides a function for LDA (linear discriminant analysis) with R. Linear discriminant analysis produces results based on the assumptions that variables are normally distributed and the normal distributions for each class share the same covariance matrix. We scaled the data above because of these assumptions before doing the LDA. The LDA is a classification method, it calculates the probabilities for the new observation for belonging in each of the classes. The observation is classified to the class of the highest probability. The LDA finds the linear combination of the variables that separate the target variable classes.

Let’s fit the LDA on the train set and draw the LDA plot, the categorical crime rate being the target variable and all the other variables in the dataset as predictor variables.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2599010 0.2549505 0.2376238 0.2475248 
## 
## Group means:
##                  zn      indus         chas        nox         rm        age
## low       0.9605778 -0.8937507 -0.122344297 -0.8669472  0.4536290 -0.8556430
## med_low  -0.1169566 -0.2355425  0.033465125 -0.5457455 -0.1384331 -0.3229070
## med_high -0.3782608  0.2105687  0.260819923  0.4530886  0.0972569  0.4205726
## high     -0.4872402  1.0149946  0.003267949  1.0344956 -0.4218576  0.8280654
##                 dis        rad        tax     ptratio       black       lstat
## low       0.8700574 -0.6953013 -0.7454004 -0.43608720  0.38296388 -0.78475947
## med_low   0.3020064 -0.5615100 -0.4755616 -0.01937294  0.31861497 -0.10953406
## med_high -0.4090673 -0.4160119 -0.3041965 -0.38699600  0.07178924  0.04524294
## high     -0.8467313  1.6596029  1.5294129  0.80577843 -0.70432364  0.86680905
##                  medv
## low       0.521339400
## med_low  -0.004939346
## med_high  0.188295902
## high     -0.625719386
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.09538841  0.754198583 -0.90490874
## indus    0.05404608 -0.178930356  0.29721872
## chas    -0.05255782 -0.060338416  0.14639998
## nox      0.21087862 -0.832583555 -1.32120635
## rm      -0.14704053 -0.097316794 -0.17496221
## age      0.25280457 -0.188904941 -0.19603335
## dis     -0.08380742 -0.304821537  0.09226649
## rad      3.73507325  0.949650630 -0.11279153
## tax      0.04005150 -0.075937806  0.50489433
## ptratio  0.11296447  0.108578993 -0.14085908
## black   -0.14092030  0.008421991  0.09977029
## lstat    0.18787340 -0.393768145  0.31921122
## medv     0.18885411 -0.419071960 -0.17299000
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9577 0.0315 0.0108
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

5. Prediction of values based on the LDA model

Let’s predict the categories with the LDA model on the test data. First, the crime categories from the test set are saved and the categorical crime variable is removed from the test dataset.

# save the correct crime categories from test dataset
correct_classes <- test$crime

# remove the categorical crime variable from the test dataset
test <- dplyr::select(test, -crime)

# predict the classes with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results with the crime categories from the test set
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      10        0    0
##   med_low    4      18        1    0
##   med_high   1      14       13    2
##   high       0       0        1   26

The prediction results with the crime categories from the test set are shown in the cross table above. In the table the predicted outcomes are columns and the true outcomes are the rows. The diagonal elements of the table show the correct predictions and the off-diagonal elements show the incorrect predictions. We can calculate from the cross table that 70.6% (72/102) of the predictions are correct.

6. Distance measures and K-means clustering

Distances can be measured with base R’s dist() function. The function creates a distance matrix that is saved as dist object, the distance matrix is usually square matrix containing the pairwise distances of the observations. Similarity (nearness) is determined using a distance metric, which is a function that measureshow far two records are from one another.

To measure the Euclidean distance between two vectors, the one is substracted from the other, differences are squared, they are summed, and the squaer root is taken. Another common distance metric for numeric data is Manhattan distance. Euclidean distance corresponds to the straight-line distance between two points. Manhattan distance is the distance between two points traversed in a single direction at a time. Different distance measures produce different output.

Clustering is a technique to divide data into different groups, where the records on each group are similar to one another. The goal of clustering is to identify significant and meaningful groups of data. K-means divides the data into K clusters by minimizing the sum of the squared distances of each record to the mean of its assigned cluster. K-means does not ensure the clusters will have the same size, but finds the clusters that are best separated. It’s typical to standardize continuous variables, otherwise, variables with large scale will dominate the clustering process.

Let’s standardize the Boston dataset and calculate the distances.

# Reload the Boston dataset and standardize the dataset 

# load MASS and Boston
library(MASS)
data('Boston')

# standardize the dataset 
boston_standardized <- scale(Boston)

## calculate the distances between the observations
# euclidean distance matrix
dist_eu <- dist(boston_standardized)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_standardized, method = 'manhattan')

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Let’s run k-means algorithm on the Boston dataset. K-means needs the number of clusters as an argument, let’s choose 3. Let’s also visualize the clusters with pairs() function.

# k-means clustering
km <- kmeans(boston_standardized, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_standardized, col = km$cluster)

The clusters seem to somewhat overlap in the plot. To find out the optimal number of clusters, let’s calculate the total of within cluster sum of squares (WCSS) and plot the number of clusters and the total WCSS. The optimal number of clusters is when the total WCSS drops radically.

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_standardized, k)$tot.withinss})

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

We can see from the plot that the total WCSS drops radically when number of clusters is 2. Let’s visualize the clusters with pairs() function.

# k-means clustering
km <-kmeans(boston_standardized, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_standardized, col = km$cluster)

After repeating the clustering with the optimal number of clusters, 2 differentiated clusters (in black and red color) are now shown in the plot.


Dimensionality reduction techniques

1. Overview of the data

The original data is from the United Nations Development Programme and information of the data can be found via these links: http://hdr.undp.org/en/content/human-development-index-hdi http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf https://raw.githubusercontent.com/TuomoNieminen/Helsinki-Open-Data-Science/master/datasets/human_meta.txt

The selected variable names have been shortened and two new variables have been computed. Observations with missing values have been removed, observations indicating regions instead of countries have been removed, the original Country variable has been removed and the countries have been computed as row names of the dataframe. The dataframe has now 8 variables (columns) and 155 observations (rows).

Meaning of the variable abbreviations:
ratio_sec_edu = ratio: proportion of females with at least secondary education / proportion of males with at least secondary education ratio_labour = ratio: proportion of females in the labour force / proportion of males in the labour force
life_exp_birth = life expectancy at birth
exp_years_edu = expected years at schooling
GNI_per_capita = Gross National Income per capita
mat_mort_ratio = maternal mortality ratio
adol_birth_rate = adolescent birth rate
repr_parl = percetange of female representatives in parliament

## RStudio Exercise 5: Analysis
date()
## [1] "Sun Dec  5 19:15:43 2021"
# read in human data
human <- read.table('~/IODS-project/data/human.csv')

# first six observations of the human data
head(human)
##             ratio_sec_edu ratio_labour life_exp_birth exp_years_edu
## Norway          1.0072389    0.8908297           81.6          17.5
## Australia       0.9968288    0.8189415           82.4          20.2
## Switzerland     0.9834369    0.8251001           83.0          15.8
## Denmark         0.9886128    0.8840361           80.2          18.7
## Netherlands     0.9690608    0.8286119           81.6          17.9
## Germany         0.9927835    0.8072289           80.9          16.5
##             GNI_per_capita mat_mort_ratio adol_birth_rate repr_parl
## Norway               64992              4             7.8      39.6
## Australia            42261              6            12.1      30.5
## Switzerland          56431              6             1.9      28.5
## Denmark              44025              5             5.1      38.0
## Netherlands          45435              6             6.2      36.9
## Germany              43919              7             3.8      36.9
# structure of the human data
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ ratio_sec_edu  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ ratio_labour   : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ life_exp_birth : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ exp_years_edu  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI_per_capita : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ mat_mort_ratio : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ adol_birth_rate: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ repr_parl      : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# basic statistics of the human data
summary(human)
##  ratio_sec_edu     ratio_labour    life_exp_birth  exp_years_edu  
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##  GNI_per_capita   mat_mort_ratio   adol_birth_rate    repr_parl    
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Let’s visualize the data with ggpairs() and corrplot function.

# access GGally
library(GGally)
# access dplyr
library(dplyr)
# access corrplot
library(corrplot)

# visualize the human data variables
ggpairs(human)

The correlation matrix done with ggpairs() shows scatter plots of each pair of numeric variable, drawn on the left part of the plot, pearson correlation is shown on the right and variable distribution is displayed on the diagonal. The strongest positive correlation is between exp_years_edu - life_exp_birth and adol_birth_rate - mat_mort_ratio variable pairs. The strongest negative correlation is between mat_mort_ratio - life_exp_birth and mat_mort_ratio and exp_years_edu variable pairs. When looking at the diagonal distributions, the GNI_per_capita, mat_mort_ratio, adol_birth_rate and repr_parl variables have a right-skewed distribution. The ratio_sec_edu, ratio_labour and life_exp_birth variables have a left_skewed distribution. The exp_years_edu variable has a normal distribution.

# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot

To study linear connections, correlations can also be computed with the cor() function and then visualized with the corrplot function from the corrplot package. The variables are in a diagonal line from top left to bottom right and each variable are plotted against each other. The boxes on the upper right of the plot are mirror images of the plots on the lower left. The stronger the red color is, the stronger the negative correlation and the stronger the blue color is, the stronger the positive correlation is between the variables. We can see the same correlation differences from this plot than from the above correlation matrix but with this visualization the differences are more distinguishable.

The correlation does not imply causation, however it seems that people with higher life expectancy at birth have higher expected years at schooling and the higher adolescent birth rate they have, the higher the maternal mortal ratio is. It also seems that when maternal mortality increases, life expectancy at birth decreases. And when maternal mortality increases, expected years at schooling decrease.

2. Principal component analysis (PCA)

Principal component analysis is a dimensionality-reduction method that is used to reduce the dimensionality of large data sets by transforming a large set of variables into smaller one that still contains most of the information in the large set. The PCA is a technique to discover the way in which numeric variables covary. Singular Value Decomposition (SVD) is the most important tool for reducing the number of dimensions in multivariate data. The 1st principal component (PC1) captures the maximum amount of variance from the features in the original data. The 2nd principal component (PC2) is orthogonal to the first and it captures the maximum amount of variability left. The same is true for each principal component. They are all uncorreleated and each is less important than the previous one in terms of captured variance.

Let’s perform PCA on the not standardized human data.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1) 

# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Biplot displays the observations by the first two principal components (PCs), PC1 coordinate in x-axis, PC2 coordinate in y-axis. Arrows represent the original variables.

The PCA method is sensitive regarding the variances of the initial variables. If there are large differences between the ranges of the initial variables, the variables with larger ranges will dominate over those with small ranges, which will lead to biased results. We can see here that it seems like PC1 explains most of the variance in the data if the PCA is performed on the not standardized data, the results are biased.

3. Repeating PCA after data standardazing

The point of the data standardization is to standardize the range of the continuous initial variables so that each of them contributes equally to the analysis. By transforming the data to comparable scales with standardization prior to the PCA we can prevent biased results.

Let’s standardize the data and then repeat the PCA.

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##  ratio_sec_edu      ratio_labour     life_exp_birth    exp_years_edu    
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##  GNI_per_capita    mat_mort_ratio    adol_birth_rate     repr_parl      
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
pca_human2 <- prcomp(human_std)

# create and print out a summary of pca_human
s2 <- summary(pca_human2)
s2
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# rounded percentages of variance captured by each PC
pca_pr2 <- round(100*s2$importance[2,], digits = 1) 

# print out the percentages of variance
pca_pr2
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# create object pc_lab to be used as axis labels
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")

# draw a biplot
biplot(pca_human2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab2[1], ylab = pc_lab2[2])

After the standardization and rerunning the PCA analysis we can see that now higher number of PCs explain the variance in the data. Here in the biplot, the dimensionality of the human data is reduced to two PCs. The PC1 captures 53.6% of the total variance and the PC2 16.2% of the total variance in the original 8 variables. The observations are on x and y coordinates, defined by the two PCs. The arrows visualize the connections between the original variables and the PCs.

The angle between arrows representing the original variables can be interpret as the correlation between the variables (small angle = high positive correlation) and the angle between a variable and a PC axis can be interpret as the correlation between the two (small angle = high positive correlation). The length of the arrows are proportional to the standard deviations of the variables.

We can see from the biplot that the expected years at schooling, GNI per capita, life expectancy at birth and ratio of secondary education variables correlate to each other and they are pointing to the same direction as PC1, these variables contribute to same dimension. The variables maternal mortality ratio and adolescent birth rate correlates to each other and are pointing to the direction of PC2, these variables contribute to this dimension.

4. Multiple correspondence analysis (MCA)

Let’s explore the tea data.

library(FactoMineR)
library(ggplot2)
library(dplyr)
library(tidyr)
# read in tea data
data(tea)

## explore the data
# structure of the data
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# dimensions of the data (rows and columns)
dim(tea)
## [1] 300  36

Let’s choose six variables for a new tea_time data.

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))

# structure of the data
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# dimensions of the data (rows and columns)
dim(tea_time)
## [1] 300   6

Let’s visualize the tea_time dataset.

# visualize the tea_time dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

The tea_time dataframe has 6 variables (columns) and 300 observations (rows). We can see from the barplots the distributions (frequencies for each questionnaire answers) of each factor (categorical) variables. The data consern a questionnaire on tea.

Multiple correspondence analysis (MCA) is also a dimensionality reduction method, it analyses the pattern of relationships of several categorical variables. It’s a method to analyze qualitative data and it is an extension of Correspondence analysis (CA). MCA can be used to detect patterns or structure in the data as well as in dimension reduction.

Let’s perform MCA with the MCA() function on the tea_time data.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

The MCA summary contains eigenvalues and individuals. The eigenvalues summary gives the variances and the percentage of variances retained by each dimension. The individuals summary gives the individual coordinates, individuals contribution (%) on the dimension and the cos2 (the squared correlations) on the dimensions. The categories summary gives the coordinates of the contribution (%), the cos2 and v.test value. The categorical summary gives the squared correlation between each variable and the dimensions.

The MCA biplot shows the variables on the first two dimensions, the distance between variable categories gives a measure of their similarity. Colors represent different variables. The biplot helps to identify variables that are most correlated with each dimension. It can be seen from the biplot that the questionnaire answers tea shop and unpackaged are the most correlated with dimension 1, and the variable other is the most correlated with dimension 2, for example. The dimension 1 captures 15.24% of the total variance and the dimension 2 14.23% of the total variance in these 6 variables.


(more chapters to be added similarly as we proceed with the course!)